-
Notifications
You must be signed in to change notification settings - Fork 199
FEAT - Adding decimal as parameter for ToFloat32
#1772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
FEAT - Adding decimal as parameter for ToFloat32
#1772
Conversation
decimal as parameter for ToFloat32
doc/modules/column_level_featurizing/feature_engineering_numerical.rst
Outdated
Show resolved
Hide resolved
skrub/tests/test_to_float.py
Outdated
| (",56", 0.56, ","), | ||
| ], | ||
| ) | ||
| def test_number_parsing(input_str, expected_float, decimal, df_module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might it be worth adding tests for the code's behaviour in case of an invalid entry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we should check a few weird cases and make sure they fail as expected
|
After some discussion, I think this PR needs some more time before it can be merged, and unfortunately won't be part of the next release. The current implementation is removing all thousands separators other than what is specified as the "decimal" separator, which is quite risky and may leads to problems. It's better to follow what pandas is doing, i.e., have both If there is some kind of weird string like Another check that may be considered is counting the number of Some additional comments:
I'll convert this back to draft and keep an eye on this for the next PR. |
|
When talking about the tests, it was mentioned to include 3 tests:
|
doc/modules/column_level_featurizing/feature_engineering_numerical.rst
Outdated
Show resolved
Hide resolved
doc/modules/column_level_featurizing/feature_engineering_numerical.rst
Outdated
Show resolved
Hide resolved
| During ``fit``, |ToFloat| attempts to convert all values in the column to | ||
| numeric values after automatically removing other possible thousands separators | ||
| (``,``, ``.``, space, apostrophe). If any value cannot be converted, the column | ||
| is rejected with a ``RejectColumn`` exception. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not how the current version of the code is working: the regex pattern should reject anything that contains characters different from either the decimal or thousands separator.
There should also be an explanation of how the check is done (checking if there are parentheses, checking if thousands are separated by groups of 3 digits, adding the scientific notation)
doc/modules/column_level_featurizing/feature_engineering_numerical.rst
Outdated
Show resolved
Hide resolved
skrub/tests/test_to_float.py
Outdated
| ("1,,234", ".", ","), | ||
| ("1.23,45", ".", ","), | ||
| # decimal == thousand | ||
| ("123,456,789", ",", ","), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are testing that RejectColumn is raised as expected when it encounters values that should not be converted. This case should be moved to a separate test that verifies that the correct exception is raised if the parameters are incorrect. The same (new) test should also check that a ValueError is raised if decimal is None.
|
Thanks a lot for the PR @gabrielapgomezji! This will be very useful for parsing data that is not in the usual locale. My comments are mostly about improving clarity in the documentation and adding comments in the code. I think the actual content of the PR is in a good shape, it's just a matter of polishing at this point. |
skrub/_to_float.py
Outdated
| if self.thousand is None: | ||
| self.thousand = "" # No thousand separator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be moved to the init, parameters should not be modified in the fit
…ical.rst Co-authored-by: Riccardo Cappuzzo <7548232+rcap107@users.noreply.github.com>
…ical.rst Co-authored-by: Riccardo Cappuzzo <7548232+rcap107@users.noreply.github.com>
…ical.rst Co-authored-by: Riccardo Cappuzzo <7548232+rcap107@users.noreply.github.com>
Co-authored-by: Riccardo Cappuzzo <7548232+rcap107@users.noreply.github.com>
Co-authored-by: Riccardo Cappuzzo <7548232+rcap107@users.noreply.github.com>
Co-authored-by: Riccardo Cappuzzo <7548232+rcap107@users.noreply.github.com>
Co-authored-by: Riccardo Cappuzzo <7548232+rcap107@users.noreply.github.com>
5e1dd9b to
095f403
Compare
| def _str_is_valid_number_polars(col, number_re): | ||
| # Check if all values in the column match the number regex. | ||
| # - Fill NaN values with empty string to avoid match errors. | ||
| # - Use `str.match` with `na=False` to treat empty/missing values as non-matching. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # - Use `str.match` with `na=False` to treat empty/missing values as non-matching. | |
| # - Use `str.contains` with `literal=False` to treat empty/missing values as non-matching. |
rcap107
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more cosmetic fixes, but I think we're almost done here. Thanks @gabrielapgomezji
| strings to floats. Other possible decimal separators are removed from | ||
| the strings before conversion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not the case anymore
| strings to floats. Other possible decimal separators are removed from | |
| the strings before conversion. | |
| strings to floats. |
| 1 12300.0 | ||
| Name: x, dtype: float32 | ||
| It is possible to specify the thousands separator, e.g., to use " " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| It is possible to specify the thousands separator, e.g., to use " " | |
| It is possible to specify the thousands separator, e.g., to use ``" "`` |
| It is possible to specify the thousands separator, e.g., to use " " | ||
| >>> s = pd.Series(["4 567,89", "12 567,89"], name="x") | ||
| >>> ToFloat(decimal=",", thousand=" ").fit_transform(s) # doctest: +ELLIPSIS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ELLIPSIS is enabled by default
| >>> ToFloat(decimal=",", thousand=" ").fit_transform(s) # doctest: +ELLIPSIS | |
| >>> ToFloat(decimal=",", thousand=" ").fit_transform(s) |
|
I did a very quick and dirty benchmark comparing the performance of the ToFloat transformer in the latest version of skrub, and the ToFloat in this PR. I generated a dataframe with 10M rows and 30 columns. There is a small, but noticeable difference in time when there is no conversion to be done, so we might want to add a condition where the formatting check is skipped in the default case (decimal="." and thousands=None). Codeimport time
import polars as pl
import numpy as np
import skrub
from importlib import metadata
import polars.selectors as cs
version = metadata.version("skrub")
print(f"skrub version: {version}")
# Set random seed for reproducibility
np.random.seed(42)
# Generate random float data
data = np.random.uniform(1000, 1000000, size=(10_000_000, 30))
# Convert floats to strings with space as thousands separator
df = pl.DataFrame(data)
from skrub import ToFloat, ApplyToCols
def benchmark_tofloat(df):
tic = time.time()
transformer = ApplyToCols(ToFloat())
transformed_df = transformer.fit_transform(df)
toc = time.time()
return toc - tic
# Run benchmark
times = []
for run in range(100):
elapsed_time = benchmark_tofloat(df)
times.append(elapsed_time)
elapsed_time = np.median(times)
print(f"Elapsed time for ToFloat transformation: {elapsed_time:.4f} seconds") |
This issue addresses #1728
The ToFloat transformer now includes a
decimalparameter that lets the user specify the decimal separator to use for the given column. Then, all the possible thousands separators are removed, and the decimal separator is converted to a.before the column is passed toto_float32.